WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Srinivasan, Krishna; Raman, Karthik; Chen, Jiecao; Bendersky, Michael; Najork, Marc

doi:10.1145/3404835.3463257

Computer Science > Computer Vision and Pattern Recognition

arXiv:2103.01913 (cs)

[Submitted on 2 Mar 2021 (v1), last revised 3 Mar 2021 (this version, v2)]

Title:WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Authors:Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, Marc Najork

View PDF

Abstract:The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage large high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset (this https URL) to better facilitate multimodal, multilingual learning. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset by the number of image-text examples by 3x (at the time of writing). Second, WIT is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 12K examples) and provides cross-lingual texts for many images. Third, WIT represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, WIT provides a very challenging real-world test set, as we empirically illustrate using an image-text retrieval task as an example.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL); Information Retrieval (cs.IR)
Cite as:	arXiv:2103.01913 [cs.CV]
	(or arXiv:2103.01913v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2103.01913
Related DOI:	https://doi.org/10.1145/3404835.3463257

Submission history

From: Krishna Srinivasan [view email]
[v1] Tue, 2 Mar 2021 18:13:54 UTC (9,949 KB)
[v2] Wed, 3 Mar 2021 16:41:01 UTC (6,584 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators